\begin{aosachapter}{Zotonic}{s:zotonic}{Arjan Scherpenisse and Marc Worrell}

\aosasecti{Introduction to Zotonic}

Zotonic is an open source framework for doing full-stack web
development, all the way from frontend to backend. Consisting of a small
set of core functionalities, it implements a lightweight but extensible
Content Management System on top. Zotonic's main goal is to make it easy
to create well-performing websites ``out of the box'', so that a website
scales well from the start.

While it shares many features and functionalities with web development
frameworks like Django, Drupal, Ruby on Rails and Wordpress, its main
competitive advantage is the language that Zotonic is powered by:
Erlang. This language, originally developed for building phone switches,
allows Zotonic to be fault tolerant and have great performance
characteristics.

Like the title says, this chapter focusses on the performance of
Zotonic. We'll look at the reasons why Erlang was chosen as the
programming platform, then inspect the HTTP request stack, then dive in
to the caching strategies that Zotonic employs. Finally, we'll describe
the optimisations we applied to Zotonic's submodules and the database.

\aosasecti{Why Zotonic? Why Erlang?}

The first work on Zotonic was started in 2008, and, like many projects,
came from ``scratching an itch''. Marc Worrell, the main Zotonic
architect, had been working for seven years at Mediamatic Lab, in
Amsterdam, on a Drupal-like CMS written in PHP/MySQL called Anymeta.
Anymeta's main paradigm was that it implemented a ``pragmatic approach
to the Semantic Web'' by modeling everything in the system as generic
``things''. Though successful, its implementations suffered from
scalability problems.

After Marc left Mediamatic, he spent a few months designing a proper,
Anymeta-like CMS from scratch. The main design goals for Zotonic were
that it had to be easy to use for frontend developers; it had to support
easy development of real-time web interfaces, simultaneously allowing
long-lived connections and many short requests; and it had to have
well-defined performance characteristics. More importantly, it had to
solve the most common problems that limited performance in earlier Web
development approaches---for example, it had to withstand the ``Shashdot
Effect'' (a sudden rush of visitors).

\aosasectii{Problems with the Classic PHP+Apache Approach}

A classic PHP setup runs as a module inside a container web server like
Apache. On each request, Apache decides how to handle the request. When
it's a PHP request, it spins up \texttt{mod\_php5}, and then the PHP
interpreter starts interpreting the script. This comes with startup
latency: typically, such a spin-up already takes 5 ms, and then the PHP
code still needs to run. This problem can partially be mitigated by
using PHP accelerators which precompile the PHP script, bypassing the
interpreter. The PHP startup overhead can also be mitigated by using a
process manager like PHP-FPM.

Nevertheless, systems like that still suffer from a \emph{shared
nothing} architecture. When a script needs a database connection, it
needs to create one itself. Same goes any other I/O resource that could
otherwise be shared between requests. Various modules feature persistent
connections to overcome this, but there is no general solution to this
problem in PHP.

Handling long-lived client connections is also hard because such
connections need a separate web server thread or process for every
request. In the case of Apache and PHP-FPM, this does not scale with
many concurrent long-lived connections.

\aosasectii{Requirements for a Modern Web Framework}

Modern web frameworks typically deal with three classes of HTTP request.
First, there are dynamically generated pages: dynamically served,
usually generated by a template processor. Second, there is static
content: small and large files which do not change (e.g., JavaScript,
CSS, and media assets). Third, there are long-lived connections:
WebSockets and long-polling requests for adding interactivity and
two-way communication to pages.

Before creating Zotonic, we were looking for a software framework and
programming language that would allow us to meet our design goals (high
performance, developer friendliness) and sidestep the bottlenecks
associated with traditional web server systems. Ideally the software
would meet the following requirements.

\begin{aosaitemize}

\item
  Concurrent: it needs to support many concurrent connections that are
  not limited by the number of unix processes or OS threads.
\item
  Shared resources: it needs to have a mechanism to share resources
  cheaply (e.g., caching, db connections) between requests.
\item
  Hot code upgrades: for ease of development and the enabling of
  hot-upgrading production systems (keeping downtime to a minimum), it
  would be nice if code changes could be deployed in a running system,
  without needing to restart it.
\item
  Multi-core CPU support: a modern system needs to scale over multiple
  cores, as current CPUs tend to get scale in number of cores as opposed
  to increased clock speed.
\item
  Fault tolerant: the system needs to be able to handle exceptional
  situations, ``badly behaving'' code, anomalies or resource starvation.
  Ideally, the system would achieve this by having some kind of
  supervision mechanism to restart the failing parts.
\item
  Distributed: ideally, a system has built-in and easy to set up support
  for distribution over multiple nodes, to allow for better performance
  and protection against hardware failure.
\end{aosaitemize}

\aosasectii{Erlang to the Rescue}

To our knowledge, Erlang was the only language that met these
requirements ``out of the box''. The Erlang VM, combined with its Open
Telecom Platform (OTP), provided the system that gave and continues to
give us all the necessary features.

Erlang is a (mostly) functional programming language and runtime system.
Erlang/OTP applications were originally developed for telephone
switches, and are known for their fault-tolerance and their concurrent
nature. Erlang employs an actor-based concurrency model: each actor is a
lightweight ``process'' (green thread) and the only way to share state
between processes is to pass messages. The Open Telecom Platform is the
set of standard Erlang libraries which enable fault tolerance and
process supervision, amongst others.

Fault tolerance is at the core of its programming paradigm: \emph{let it
crash} is the main philosophy of the system. As processes don't share
any state (to share state, they must send messages to each other), their
state is isolated from other processes. As such, a single crashing
process will never take down the system. When a process crashes, its
supervisor process can decide to restart it.

\emph{Let it crash} also allows you to program for the happy case. Using
pattern matching and function guards to assure a sane state means less
error handling code is needed, which usually results in clean, concise,
and readable code.

\aosasecti{Zotonic's Architecture}

Before we discuss Zotonic's performance optimizations, let's have a look
at its architecture. \aosafigref{posa.zotonic.arch} describes Zotonic's
most important components.

\aosafigure[143pt]{zotonic-images/zotonic-architecture.png}{The architecture of Zotonic}{posa.zotonic.arch}

The diagram shows the layers of Zotonic that an HTTP request goes
through. For discussing performance issues we'll need to know what these
layers are, and how they affect performance.

First, Zotonic comes with a built in web server, Mochiweb (another
Erlang project). It does not require an external web server. This keeps
the deployment dependencies to a minimum.\footnote{However, it is
  possible to put another web server in front, for example when other
  web systems are running on the same server. But for normal cases, this
  is not needed. It is interesting that a typical optimisation that
  other frameworks use is to put a caching web server such as Varnish in
  front of their application server for serving static files, but for
  Zotonic this does not speed up those requests significantly, as
  Zotonic also caches static files in memory.)}

Like many web frameworks, a URL routing system is used to match requests
to controllers. Controllers handle each request in a RESTful way, thanks
to the Webmachine library.

Controllers are ``dumb'' on purpose, without much application-specific
logic. Zotonic provides a number of standard controllers which, for the
development of basic web applications, are often good enough. For
instance, there is a \texttt{controller\_template}, whose sole purpose
it is to reply to HTTP GET requests by rendering a given template.

The template language is an Erlang-implementation of the well-known
Django Template Language, called ErlyDTL. The general principle in
Zotonic is that the templates drive the data requests. The templates
decide which data they need, and retrieve it from the models.

Models expose functions to retrieve data from various data sources, like
a database. Models expose an API to the templates, dictating how they
can be used. The models are also responsible for caching their results
in memory; they decide when and what is cached and for how long. When
templates need data, they call a model as if it were a globally
available variable.

A model is an Erlang wrapper module which is responsible for certain
data. It contains the necessary functions to retrieve and store data in
the way that the application needs. For instance, the central model of
Zotonic is called \texttt{m.rsc}, which provide access to the generic
resource (``page'') data model. Since resources use the database,
\texttt{m\_rsc.erl} uses a database connection to retrieve its data and
pass it through to the template, caching it whenever it can.

This ``templates drive the data'' approach is different from other web
frameworks like Rails and Django, which usually follow a more classical
MVC approach where a controller assigns data to a template. Zotonic
follows a less ``controller-centric'' approach, so that typical websites
can be built by just writing templates.

Zotonic uses PostgreSQL for data persistence.
\aosasecref{posa.zotonic.db} explains the rationale for this choice.

\aosasectii{Additional Zotonic Concepts}

While the main focus of this chapter are the performance characteristics
of the web request stack, it is useful to know some of the other
concepts that are at the heart of Zotonic.

\begin{aosadescription}
\item[Virtual hosting]
A single Zotonic instance typically serves more than one site. It is
designed for virtual hosting, including domain aliases and SSL support.
And due to Erlang's process-isolation, a crashing site does not affect
any of the other sites running in the same VM.
\item[Modules]
Modules are Zotonic's way of grouping functionality together. Each
module is in its own directory containing Erlang files, templates,
assets, etc. They can be enabled on a per-site basis. Modules can hook
into the admin system: for instance, the \texttt{mod\_backup} module
adds version control to the page editor and also runs a daily full
database backup. Another module, \texttt{mod\_github}, exposes a
\texttt{webhook} which pulls, rebuilds and reloads a Zotonic site from
github, allowing for continuous deployment.
\item[Notifications]
To enable the loose coupling and extensibility of code, communication
between modules and core components is done by a notification mechanism
which functions either as a map or fold over the observers of a certain
named notification. By listening to notifications it becomes easy for a
module to override or augment certain behaviour. The calling function
decides whether a map or fold is used. For instance, the
\texttt{admin\_menu} notification is a fold over the modules which allow
modules to add or remove menu items in the admin menu.
\item[Data model]
The main data model that Zotonic uses can be compared to Drupal's Node
module; ``every thing is a thing''. The data model consists of
hierarchically categorized resources which connect to other resources
using labelled edges. Like its source of inspiration, the Anymeta CMS,
this data model is loosely based on the principles of the Semantic Web.
\end{aosadescription}

Zotonic is an extensible system, and all parts of the system add up when
you consider performance. For instance, you might add a module that
intercepts web requests, and does something on each request. Such a
module might impact the performance of the entire system. In this
chapter we'll leave this out of consideration, and instead focus on the
core performance issues.

\aosasecti{Problem Solving: Fighting the Slashdot Effect}

Most web sites live an unexceptional life in a small place somewhere on
the web. That is, until one of their pages hit the front page of a
popular website like CNN, BBC or Yahoo. In that case, the traffic to the
website will likely increase to tens, hundreds, or even thousands of
page requests per second in no time.

Such a sudden surge overloads a traditional web server and makes it
unreachable. The term ``Slashdot Effect'' was named after the web site
that started this kind of overwhelming referrals. Even worse, an
overloaded server is sometimes very hard to restart. As the newly
started server has empty caches, no database connections, often
un-compiled templates, etc.

Many anonymous visitors requesting exactly the same page around the same
time shouldn't be able to overload a server. This problem is easily
solved using a caching proxy like Varnish, which caches a static copy of
the page and only checks for updates to the page once in a while.

A surge of traffic becomes more challenging when serving dynamic pages
for every single visitor; these can't be cached. With Zotonic, we set
out to solve this problem.

We realized that most web sites have

\begin{aosaitemize}

\item
  only have a limited number of very popular pages,
\item
  a long tail of far less popular pages, and
\item
  many shared parts on all pages (menu, most read items, news, etc.).
\end{aosaitemize}

\noindent and decided to

\begin{aosaitemize}

\item
  cache hot data in memory so no communication needed to access it,
\item
  share renderings of templates and sub-templates between requests and
  on pages on the web site, and
\item
  explicitly design the system to prevent overload on server start and
  restart.
\end{aosaitemize}

\aosasectii{Cache Hot Data}

Why fetch data from an external source (database, memcached) when
another request fetched it already a couple of milliseconds ago? We
always cache simple data requests. In the next section the caching
mechanism is discussed in detail.

\aosasectii{Share Rendered Templates and Sub-templates Between Pages}

When rendering a page or included template, a developer can add optional
caching directives. This caches the rendered result for a period of
time.

Caching starts what we called the \emph{memo} functionality: while the
template is being rendered and one or more processes request the same
rendering, the later processes will be suspended. When the rendering is
done all waiting processes will be sent the rendering result

The memoization alone---without any further caching---gives a large
performance boost by drastically reducing the amount of parallel
template processing.

\aosasectii{Prevent Overload on Server Start or Restart}

Zotonic introduces several bottlenecks on purpose. These bottlenecks
limit the access to processes that use limited resources or are
expensive (in terms of CPU or memory) to perform. Bottlenecks are
currently set up for the template compiler, the image resizing process,
and the database connection pool.

The bottlenecks are implemented by having a limited worker pool for
performing the requested action. For CPU or disk intensive work, like
image resizing, there is only a single process handling the requests.
Requesting processes post their request in the Erlang request queue for
the process and wait until their request is handled. If a request times
out it will just crash. Such a crashing request will return HTTP status
503 \emph{Service not available}.

Waiting processes don't use many resources and the bottlenecks protect
against overload if a template is changed or an image on a hot page is
replaced and needs cropping or resizing.

In short: a busy server can still dynamically update its templates,
content and images without getting overloaded. At the same time it
allows single requests to crash while the system itself continues to
operate.

\aosasectii{The Database Connection Pool}

One more word on database connections. In Zotonic a process fetches a
database connection from a pool of connections for every single query or
transaction. This enables many concurrent processes to share a very
limited number of database connections. Compare this with most (PHP)
systems where every request holds a connection to the database for the
duration of the complete request.

Zotonic closes unused database connections after a time of inactivity.
One connection is always left open so that the system can always handle
an incoming request or background activity quickly. The dynamic
connection pool drastically reduces the number of open database
connections on most Zotonic web sites to one or two.

\aosasecti{Caching Layers}

The hardest part of caching is cache invalidation: keeping the cached
data fresh and purging stale data. Zotonic uses a central caching
mechanism with dependency checks to solve this problem.

This section describes Zotonic's caching mechanism in a top-down
fashion: from the browser down through the stack to the database.

\aosasectii{Client-Side Caching}

The client-side caching is done by the browser. The browser caches
images, CSS and JavaScript files. Zotonic does not allow client-side
caching of HTML pages, it always dynamically generates all pages.
Because it is very efficient in doing so (as described in the previous
section) and not caching HTML pages prevents showing old pages after
users log in, log out, or comments are placed.

Zotonic improves client-side performance in two ways:

\begin{aosaenumerate}
\def\labelenumi{\arabic{enumi}.}

\item
  It allows caching of static files (CSS, JavaScript, images etc.)
\item
  It includes multiple CSS or JavaScript files in a single response
\end{aosaenumerate}

\noindent The first is done by adding the appropriate HTTP headers to the
request\footnote{Note that Zotonic does not set an ETag. Some browsers
  check the ETag for every use of the file by making a request to the
  server. Which defies the whole idea of caching and making fewer
  requests.}:

\begin{verbatim}
Last-Modified: Tue, 18 Dec 2012 20:32:56 GMT
Expires: Sun, 01 Jan 2023 14:55:37 GMT
Date: Thu, 03 Jan 2013 14:55:37 GMT
Cache-Control: public, max-age=315360000
\end{verbatim}

\noindent Multiple CSS or JavaScript files are concatenated into a single file,
separating individual files by a tilde and only mentioning paths if they
change between files:

\begin{verbatim}
http://example.org/lib/bootstrap/css/bootstrap
  ~bootstrap-responsive~bootstrap-base-site~
  /css/jquery.loadmask~z.growl~z.modal~site~63523081976.css
\end{verbatim}

\noindent The number at the end is a timestamp of the newest file in the list. The
necessary CSS link or JavaScript script tag is generated using the
\texttt{\{\% lib \%\}} template tag.

\aosasectii{Server-Side Caching}

Zotonic is a large system, and many parts in it do caching in some way.
The sections below explain some of the more interesting parts.

\aosasectii{Static CSS, JS and Image Files}

The controller handling the static files has some optimizations for
handling these files. It can decompose combined file requests into a
list of individual files.

The controller has checks for the \texttt{If-Modified-Since} header,
serving the HTTP status 304 \emph{Not Modified} when appropriate.

On the first request it will concatenate the contents of all the static
files into one byte array (an Erlang \emph{binary}).\footnote{A byte
  array, or binary, is a native Erlang data type. If it is smaller than
  64 bytes it is copied between processes, larger ones are shared
  between processes. Erlang also shares parts of byte arrays between
  processes with references to those parts and not copying the data
  itself, thus making these byte arrays an efficient and easy to use
  data type.} This byte array is then cached in the central depcache
(see \aosasecref{posa.zotonic.depcache}) in two forms: compressed (with
gzip) and uncompressed. Depending on the \texttt{Accept-Encoding}
headers sent by the browser, Zotonic will serve either the compressed or
uncompressed version.

This caching mechanism is efficient enough that its performance is
similar to many caching proxies, while still fully controlled by the web
server. With an earlier version of Zotonic and on simple hardware (quad
core 2.4 GHz Xeon from 2008) we saw throughputs of around 6000
requests/second and were able to saturate a gigabit ethernet connection
requesting a small (\textasciitilde{}20 KB) image file.

\aosasectii{Rendered Templates}

Templates are compiled into Erlang modules, after which the byte code is
kept in memory. Compiled templates are called as regular Erlang
functions.

The template system detects any changes to templates and will recompile
the template during runtime. When compilation is finished Erlang's hot
code upgrade mechanism is used to load the newly compiled Erlang module.

The main page and template controllers have options to cache the
template rendering result. Caching can also be enabled only for
anonymous (not logged in) visitors. As for most websites, anonymous
visitors generate the bulk of all requests and those pages will be not
be personalized and (almost) be identical. Note that template rendering
results is an intermediate result and not the final HTML. This
intermediate result contains (among others) untranslated strings and
JavaScript fragments. The final HTML is generated by parsing this
intermediate structure, picking the correct translations and collecting
all javascript.

The concatenated JavaScript, along with a unique page ID, is placed at
the position of the \linebreak \texttt{\{\% script \%\}} template tag.
This should be just above the closing
\texttt{\textless{}/body\textgreater{}} body tag. The unique page ID is
used to match this rendered page with the handling Erlang processes and
for WebSocket/Comet interaction on the page.

Like with any template language, templates can include other templates.
In Zotonic, included templates are usually compiled inline to eliminate
any performance lost by using included files.

Special options can force runtime inclusion. One of those options is
caching. Caching can be enabled for anonymous visitors only, a caching
period can be set, and cache dependencies can be added. These cache
dependencies are used to invalidate the cached rendering if any of the
shown resources is changed.

Another method to cache parts of templates is to use the
\texttt{\{\% cache \%\} ... \{\% endcache \%\}} block tag, which caches
a part of a template for a given amount of time. This tag has the same
caching options as the include tag, but has the advantage that it can
easily be added in existing templates.

\aosasectii{In-Memory Caching}

All caching is done in memory, in the Erlang VM itself. No communication
between computers or operating system processes is needed to access the
cached data. This greatly simplifies and optimizes the use of the cached
data.

As a comparison, accessing a memcache server typically takes 0.5
milliseconds. In contrast, accessing main memory within the same process
takes 1 nanoseconds on a CPU cache hit and 100 nanoseconds on a CPU
cache miss---not to mention the huge speed difference between memory and
network.\footnote{See ``Latency Numbers Every Programmer Should Know''
  at
  \newline \texttt{http://www.eecs.berkeley.edu/\textasciitilde{}rcs/research/interactive\_latency.html}.}

Zotonic has two in-memory caching mechanisms \footnote{In addition to
  these mechanisms, the database server performs some in-memory caching,
  but that is not within the scope of this chapter.}:

\begin{aosaenumerate}
\def\labelenumi{\arabic{enumi}.}

\item
  Depcache, the central per-site cache
\item
  Process Dictionary Memo Cache
\end{aosaenumerate}

\aosasectii{Depcache}

\label{posa.zotonic.depcache}

The central caching mechanism in every Zotonic site is the
\emph{depcache}, which is short for \emph{dep}endency \emph{cache}. The
depcache is an in-memory key-value store with a list of dependencies for
every stored key.

For every key in the depcache we store:

\begin{aosaitemize}

\item
  the key's value;
\item
  a serial number, a global integer incremented with every update
  request;
\item
  the key's expiration time (counted in seconds);
\item
  a list of other keys that this key depends on (e.g., a resource ID
  displayed in a \linebreak cached template); and
\item
  if the key is still being calculated, a list of processes waiting for
  the key's value.
\end{aosaitemize}

If a key is requested then the cache checks if the key is present, not
expired, and if the serial numbers of all the dependency keys are lower
than serial number of the cached key. If the key was still valid its
value is returned, otherwise the key and its value is removed from the
cache and \texttt{undefined} is returned.

Alternatively if the key was being calculated then the requesting
process would be added to the waiting list of the key.

The implementation makes use of ETS, the Erlang Term Storage, a standard
hash table implementation which is part of the Erlang OTP distribution.
The following ETS tables are created by Zotonic for the depcache:

\begin{aosaitemize}

\item
  Meta table: the ETS table holding all stored keys, the expiration and
  the depending keys. A record in this table is written as
  \texttt{\#meta\{key, expire, serial, deps\}}.
\item
  Deps table: the ETS table stores the serial for each key.
\item
  Data table: the ETS table that stores each key's data.
\item
  Waiting PIDs dictionary: the ETS table that stores the IDs of all
  processes waiting for the arrival of a key's value.
\end{aosaitemize}

The ETS tables are optimized for parallel reads and usually directly
accessed by the calling process. This prevents any communication between
the calling process and the depcache process.

The depcache process is called for:

\begin{aosaitemize}

\item
  memoization where processes wait for another process's value to be
  calculated;
\item
  \emph{put} (store) requests, serializing the serial number increments;
  and
\item
  delete requests, also serializing the depcache access.
\end{aosaitemize}

The depcache can get quite large. To prevent it from growing too large
there is a garbage collector process. The garbage collector slowly
iterates over the complete depcache, evicting expired or invalidated
keys. If the depcache size is above a certain threshold (100 MiB by
default) then the garbage collector speeds up and evicts 10\% of all
encountered items. It keeps evicting until the cache is below its
threshold size.

100 MiB might sound small in this area of multi-TB databases. However,
as the cache mostly contains textual data it will be big enough to
contain the hot data for most web sites. Otherwise the size of the cache
can be changed in configuration.

\aosasectii{Process Dictionary Memo Cache}

The other memory-caching paradigm in Zotonic is the process dictionary
memo cache. As described earlier, the data access patterns are dictated
by the templates. The caching system uses simple heuristics to optimize
access to data.

Important in this optimization is data caching in the Erlang process
dictionary of the process handling the request. The process dictionary
is a simple key-value store in the same heap as the process. Basically,
it adds state to the functional Erlang language. Use of the process
dictionary is usually frowned upon for this reason, but for in-process
caching it is useful.

When a resource is accessed (remember, a resource is the central data
unit of Zotonic), it is copied into the process dictionary. The same is
done for computational results---like access control checks---and other
data like configuration values.

Every property of a resource---like its title, summary or body
text---must, when shown on a page, perform an access control check and
then fetch the requested property from the resource. Caching all the
resource's properties and its access checks greatly speeds up resource
data usage and removes many drawbacks of the hard-to-predict data access
patterns by templates.

As a page or process can use a lot of data this memo cache has a couple
of pressure valves:

\begin{aosaenumerate}
\def\labelenumi{\arabic{enumi}.}

\item
  When holding more than 10,000 keys the whole process dictionary is
  flushed. This prevents process dictionaries holding many unused items,
  like what happens when looping through long lists of resources.
  Special Erlang variables like \texttt{\$ancestors} are kept.
\item
  The memo cache must be programmatically enabled. This is automatically
  done for every incoming HTTP or WebSocket request and template
  rendering.
\item
  Between HTTP/WebSocket requests the process dictionary is flushed, as
  multiple sequential HTTP/WebSocket requests share the same process.
\item
  The memo cache doesn't track dependencies. Any depcache deletion will
  also flush the complete process dictionary of the process performing
  the deletion.
\end{aosaenumerate}

When the memo cache is disabled then every lookup is handled by the
depcache. This results in a call to the depcache process and data
copying between the depcache and the requesting process.

\aosasecti{The Erlang Virtual Machine}

The Erlang Virtual Machine has a few properties that are important when
looking at performance.

\aosasectii{Processes are Cheap}

The Erlang VM is specifically designed to do many things in parallel,
and as such has its own implementation of multiprocessing within the VM.
Erlang processes are scheduled on a reduction count basis, where one
reduction is roughly equivalent to a function call. A process is allowed
to run until it pauses to wait for input (a message from some other
process) or until it has executed a fixed number of reductions. For each
CPU core, a scheduler is started with its own run queue. It is not
uncommon for Erlang applications to have thousands to millions of
processes alive in the VM at any given point in time.

Processes are not only cheap to start but also cheap in memory at 327
words per process, which amounts to \textasciitilde{}2.5 KiB on a 64 bit
machine.\footnote{See
  \texttt{http://www.erlang.org/doc/efficiency\_guide/advanced.html\#id68921}}
This compares to \textasciitilde{}500 KiB for Java and a default of 2
MiB for pthreads.

Since processes are so cheap to use, any processing that is not needed
for a request's result is spawned off into a separate process. Sending
an email or logging are both examples of tasks that could be handled by
separate processes.

\aosasectii{Data Copying is Expensive}

In the Erlang VM messages between processes are relatively expensive, as
the message is copied in the process. This copying is needed due to
Erlang's per-process garbage collector. Preventing data copying is
important; which is why Zotonic's depcache uses ETS tables, which can be
accessed from any process.

\aosasectiii{Separate Heap for Bigger Byte Arrays}

There is a big exception for copying data between processes. Byte arrays
larger than 64 bytes are not copied between processes. They have their
own heap and are separately garbage collected.

This makes it cheap to send a big byte array between processes, as only
a reference to the byte array is copied. However, it does make garbage
collection harder, as all references must be garbage collected before
the byte array can be freed.

Sometimes, references to parts of a big byte array are passed: the
bigger byte array can't be garbage collected until the reference to the
smaller part is garbage collected. A consequence is that copying a byte
array is an optimization if that frees up the bigger byte array.

\aosasectii{String Processing is Expensive}

String processing in any functional language can be expensive because
strings are often represented as linked lists of integers, and, due to
the functional nature of Erlang, data cannot be destructively updated.

If a string is represented as a list, then it is processed using tail
recursive functions and pattern matching. This makes it a natural fit
for functional languages. The problem is that the data representation of
a linked list has a big overhead and that messaging a list to another
process always involves copying the full data structure. This makes a
list a non-optimal choice for strings.

Erlang has its own middle-of-the-road answer to strings: io-lists.
Io-lists are nested lists containing lists, integers (single byte
value), byte arrays and references to parts of other byte arrays.
Io-lists are extremely easy to use and appending, prefixing or inserting
data is inexpensive, as they only need changes to relatively short
lists, without any data copying.\footnote{Erlang can also \emph{share}
  parts of a byte array with references to those parts, thus
  circumventing the need to copy that data. An insert into a byte array
  can be represented by an io-list of three parts: a references to the
  unchanged head bytes, the inserted value, and a reference to the
  unchanged tail bytes.}

An io-list can be sent as-is to a ``port'' (a file descriptor), which
flattens the data structure to a byte stream and sends it to a socket.

Example of an io-list:

\begin{verbatim}
 [ <<"Hello">>, 32, [ <<"Wo">>, [114, 108], <<"d">>].
\end{verbatim}

\noindent which flattens to the byte array:

\begin{verbatim}
 <<"Hello World">>.
\end{verbatim}

\noindent Interestingly, most string processing in a web application consists of:

\begin{aosaenumerate}
\def\labelenumi{\arabic{enumi}.}

\item
  Concatenating data (dynamic and static) into the resulting page.
\item
  HTML escaping and sanitizing content values.
\end{aosaenumerate}

Erlang's io-list is the perfect data structure for the first use case.
And the second use case is resolved by an aggressive sanitization of all
content \emph{before} it is stored in the database.

These two combined means that for Zotonic a rendered page is just a big
concatenation of byte arrays and pre-sanitized values in a single
io-list.

\aosasectii{Implications for Zotonic}

Zotonic makes heavy use of a relatively big data structure, the
\emph{Context}. This is a record containing all data needed for a
request evaluation. It contains:

\begin{aosaitemize}

\item
  The request data: headers, request arguments, body data etc.
\item
  Webmachine status
\item
  User information (e.g., user ID, access control information)
\item
  Language preference
\item
  \texttt{User-Agent} class (e.g., text, phone, tablet, desktop)
\item
  References to special site processes (e.g., notifier, depcache, etc.)
\item
  Unique ID for the request being processed (this will become the page
  ID)
\item
  Session and page process IDs
\item
  Database connection process during a transaction
\item
  Accumulators for reply data (e.g., data, actions to be rendered,
  JavaScript files)
\end{aosaitemize}

All this data can make a large data structure. Sending this large
Context to different processes working on the request would result in a
substantial data copying overhead.

That is why we try to do most of the request processing in a single
process: the Mochiweb process that accepted the request. Additional
modules and extensions are called using function calls instead of using
inter-process messages.

Sometimes an extension is implemented using a separate process. In that
case the extension provides a function accepting the Context and the
process ID of the extension process. This interface function is then
responsible of efficiently messaging the extension process.

Zotonic also needs to send a message when rendering cacheable
sub-templates. In this case the Context is pruned of all intermediate
template results and some other unneeded data (like logging information)
before the Context is messaged to the process rendering the
sub-template.

We don't care too much about messaging byte arrays as they are, in most
cases, larger than 64 bytes and as such will not be copied between
processes.

For serving large static files, there is the option of using the Linux
\texttt{sendfile()} system call to delegate sending the file to the
operating system.

\aosasecti{Changes to the Webmachine Library}

Webmachine is a library implementing an abstraction of the HTTP
protocol. It is implemented on top of the Mochiweb library which
implements the lower level HTTP handling, like acceptor processes,
header parsing, etc.

Controllers are made by creating Erlang modules implementing callback
functions. Examples of callback functions are \texttt{resource\_exists},
\texttt{previously\_existed}, \texttt{authorized},
\texttt{allowed\_methods}, \texttt{process\_post}, etc. Webmachine also
matches request paths against a list of dispatch rules; assigning
request arguments and selecting the correct controller for handling the
HTTP request.

With Webmachine, handling the HTTP protocol becomes easy. We decided
early on to build Zotonic on top of Webmachine for this reason.

While building Zotonic a couple of problems with Webmachine were
encountered.

\begin{aosaenumerate}
\def\labelenumi{\arabic{enumi}.}

\item
  When we started, it supported only a single list of dispatch rules;
  not a list of rules per host (i.e., site).
\item
  Dispatch rules are set in the application environment, and copied to
  the request process when dispatching.
\item
  Some callback functions (like \texttt{last\_modified}) are called
  multiple times during request evaluation.
\item
  When Webmachine crashes during request evaluation no log entry is made
  by the request logger.
\item
  No support for HTTP Upgrade, making WebSockets support harder.
\end{aosaenumerate}

The first problem (no partitioning of dispatch rules) is only a
nuisance. It makes the list of dispatch rules less intuitive and more
difficult to interpret.

The second problem (copying the dispatch list for every request) turned
out to be a show stopper for Zotonic. The lists could become so large
that copying it could take the majority of time needed to handle a
request.

The third problem (multiple calls to the same functions) forced
controller writers to implement their own caching mechanisms, which is
error prone.

The fourth problem (no log on crash) makes it harder to see problems
when in production.

The fifth problem (no HTTP Upgrade) prevents us from using the nice
abstractions available in Webmachine for WebSocket connections.

The above problems were so serious that we had to modify Webmachine for
our own purposes.

First a new option was added: dispatcher. A dispatcher is a module
implementing the \texttt{dispatch/3} function which matches a request to
a dispatch list. The dispatcher also selects the correct site (virtual
host) using the HTTP \texttt{Host} header. When testing a simple ``hello
world'' controller, these changes gave a threefold increase of
throughput. We also observed that the gain was much higher on systems
with many virtual hosts and dispatch rules.

Webmachine maintains two data structures, one for the request data and
one for the internal request processing state. These data structures
were referring to each other and actually were almost always used in
tandem, so we combined them in a single data structure. Which made it
easier to remove the use of the process dictionary and add the new
single data structure as an argument to all functions inside Webmachine.
This resulted in 20\% less processing time per request.

We optimized Webmachine in many other ways that we will not describe in
detail here, but the most important points are:

\begin{aosaitemize}

\item
  Return values of some controller callbacks are cached
  (\texttt{charsets\_provided},\linebreak \texttt{content\_types\_provided},
  \texttt{encodings\_provided}, \texttt{last\_modified}, and
  \texttt{generate\_etag}).
\item
  More process dictionary use was removed (less global state, clearer
  code, easier testing).
\item
  Separate logger process per request; even when a request crashes we
  have a log up to the point of the crash.
\item
  An HTTP Upgrade callback was added as a step after the
  \emph{forbidden} access check to support WebSockets.
\item
  Originally, a controller was called a ``resource''. We changed it to
  ``controller'' to make a clear distinction between the
  (data-)resources being served and the code serving those resources.
\item
  Some instrumentation was added to measure request speed and size.
\end{aosaitemize}

\aosasecti{Data Model: a Document Database in SQL}

\label{posa.zotonic.db}

From a data perspective it is worth mentioning that all properties of a
``resource'' (Zotonic's main data unit) are serialized into a binary
blob; ``real'' database columns are only used for keys, querying and
foreign key constraints.

Separate ``pivot'' fields and tables are added for properties, or
combinations of properties that need indexing, like full text columns,
date properties, etc.

When a resource is updated, a database trigger adds the resource's ID to
the pivot queue. This pivot queue is consumed by a separate Erlang
background process which indexes batches of resources at a time in a
single transaction.

Choosing SQL made it possible for us to hit the ground running:
PostgreSQL has a well known query language, great stability, known
performance, excellent tools, and both commercial and non-commercial
support.

Beyond that, the database is not the limiting performance factor in
Zotonic. If a query becomes the bottleneck, then it is the task of the
developer to optimize that particular query using the database's query
analyzer.

Finally, the golden performance rule for working with any database is:
Don't hit the database; don't hit the disk; don't hit the network; hit
your cache.

\aosasecti{Benchmarks, Statistics and Optimizations}

We don't believe too much in benchmarks as they often test only minimal
parts of a system and don't represent the performance of the whole
system. Especially as a system has many moving parts and in Zotonic the
caching system and handling common access patterns are an integral part
of the design.

\aosasectii{A Simplified Benchmark}

What a benchmark \emph{might do} is show where you could optimize the
system first.

With this in mind we benchmarked Zotonic using the TechEmpower JSON
benchmark, which is basically testing the request dispatcher, JSON
encoder, HTTP request handling and the TCP/IP stack.

The benchmark was performed on a Intel i7 quad core M620 @ 2.67 GHz. The
command was \texttt{wrk -c 3000 -t 3000 http://localhost:8080/json}. The
results are shown in \aosatblref{posa.zotonic.bmark}.

\begin{table}[h!]
\centering
{\footnotesize
\rowcolors{2}{TableOdd}{TableEven}
\begin{tabular}{lr}
\hline
\textbf{Platform}
& \textbf{x1000 Requests/sec}
\\
\hline
Node.js
& 27
\\
Cowboy (Erlang)
& 31
\\
Elli (Erlang)
& 38
\\
Zotonic
& 5.5
\\
Zotonic w/o access log
& 7.5
\\
Zotonic w/o access log, with dispatcher pool
& 8.5
\\
\hline
\end{tabular}
}
\caption{Benchmark Results}
\label{posa.zotonic.bmark}
\end{table}

Zotonic's dynamic dispatcher and HTTP protocol abstraction gives lower
scores in such a micro benchmark. Those are relatively easy to solve,
and the solutions were already planned:

\begin{aosaitemize}

\item
  Replace the standard webmachine logger with a more efficient one
\item
  Compile the dispatch rules in an Erlang module (instead of a single
  process interpreting the dispatch rule list)
\item
  Replace the MochiWeb HTTP handler with the Elli HTTP handler
\item
  Use byte arrays in Webmachine instead of the current character lists
\end{aosaitemize}

\aosasectii{Real- Life Performance}

For the 2013 abdication of the Dutch queen and subsequent inauguration
of the new Dutch king a national voting site was built using Zotonic.
The client requested 100\% availability and high performance, being able
to handle 100,000 votes per hour.

The solution was a system with four virtual servers, each with 2 GB RAM
and running their own independent Zotonic system. Three nodes handled
voting, one node was for administration. All nodes were independent but
the voting nodes shared every vote with the at least two other nodes, so
no vote would be lost if a node crashed.

A single vote gave \textasciitilde{}30 HTTP requests for dynamic HTML
(in multiple languages), Ajax, and static assets like css and
javascript. Multiple requests were needed for selecting the three
projects to vote on and filling in the details of the voter.

When tested we easily met the customer's requirements without pushing
the system to the max. The voting simulation was stopped at 500,000
complete voting procedures per hour, using bandwidth of around 400 mbps, and
99\% of request handling times were below 200 milliseconds.

From the above it is clear that Zotonic can handle popular dynamic web
sites. On real hardware we have observed much higher performance,
especially for the underlying I/O and database performance.

\aosasecti{Conclusion}

When building a content management system or framework it is important
to take the full stack of your application into consideration, from the
web server, the request handling system, the caching systems, down to
the database system. All parts must work well together for good
performance.

Much performance can be gained by preprocessing data. An example of
preprocessing is pre-escaping and sanitizing data before storing it into
the database.

Caching hot data is a good strategy for web sites with a clear set of
popular pages followed by a long tail of less popular pages. Placing
this cache in the same memory space as the request handling code gives a
clear edge over using separate caching servers, both in speed and
simplicity.

Another optimization for handling sudden bursts in popularity is to
dynamically match similar requests and process them once for the same
result. When this is well implemented, a proxy can be avoided and all
HTML pages generated dynamically.

Erlang is a great match for building dynamic web based systems due to
its lightweight multiprocessing, failure handling, and memory
management.

Using Erlang, Zotonic makes it possible to build a very competent and
well-performing content management system and framework without needing
separate web servers, caching proxies, memcache servers, or e-mail
handlers. This greatly simplifies system management tasks.

On current hardware a single Zotonic server can handle thousands of
dynamic page requests per second, thus easily serving the fast majority
of web sites on the world wide web.

Using Erlang, Zotonic is prepared for the future of multi-core systems
with dozens of cores and many gigabytes of memory.

\aosasecti{Acknowledgements}

The authors would like to thank Michiel Klønhammer (Maximonster
Interactive Things), Andreas Stenius, Maas-Maarten Zeeman and Atilla
Erdődi.

\end{aosachapter}
